Provide Dockerfile for midstream MPI CUDA image#656
Conversation
📝 WalkthroughWalkthroughAdds a new Dockerfile building a CUDA 13.0 + Python 3.12 + PyTorch 2.9.0 training image. Installs runtime libraries and UCX RPMs, builds OpenMPI with CUDA/UCX support, configures SSH client/server and environment wrappers, installs Python deps via micropipenv plus targeted pins (numpy, scipy, pyarrow, pillow, s3fs), and introduces an uid_entrypoint script and permission changes for OpenShift UID/GID handling. Image runs as user 1001 by default and sets LD_LIBRARY_PATH, PATH, and NVIDIA_REQUIRE_CUDA env vars. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Security findings
Address each finding before approving; provide fixes as minimal, testable changes (configurable known_hosts, reduced writable scope, validated uid handling, verified RPM signatures). 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Around line 120-123: The RUN block that uses sed/echo to modify
/etc/ssh/sshd_config (the lines that set "StrictModes" and "Port") weakens SSH
trust by forcing StrictModes no and appending duplicate entries; revert this by
removing the echo that forces "StrictModes no" and ensure sshd_config keeps
StrictModes enabled (do not set StrictModes to "no"), stop appending duplicate
Port/StrictModes lines (only uncomment existing directives with sed), and add
steps to enforce secure filesystem permissions for user homes and SSH keys
(ensure user home directories are not group/world-writable and ~/.ssh is 700
with private keys 600) so the Dockerfile RUN block that edits sshd_config and
sets permissions secures SSH correctly.
- Line 154: The Dockerfile currently grants group-write to installed Python
libraries via the line that runs chmod -R g+w on
/opt/app-root/lib/python3.12/site-packages, which allows unprivileged processes
to modify site-packages; remove that chmod and instead ensure site-packages
ownership and permissions are locked down (e.g., set owner to root and remove
group/other write bits) so only privileged users can modify libraries; locate
the chmod invocation in the Dockerfile (the line touching
/opt/app-root/lib/python3.12/site-packages) and replace it with steps that set
secure ownership (chown root:root) and restrictive permissions (remove g+w and
o+w) for the site-packages directory.
- Around line 114-117: The RUN step in the Dockerfile currently disables SSH
host key verification (the sed edit to /etc/ssh/ssh_config and adding
"UserKnownHostsFile /dev/null"), which must be removed; instead revert the sed
change so StrictHostKeyChecking remains enabled, stop writing UserKnownHostsFile
/dev/null, and populate a trusted host key store (e.g., /etc/ssh/ssh_known_hosts
or per-container ~/.ssh/known_hosts) using ssh-keyscan for cluster peers at
build or startup; ensure the SSH_PORT setting is preserved but do not discard
known_hosts—update the RUN step that edits /etc/ssh/ssh_config and add a step
that gathers and installs host keys before starting SSH to maintain
verification.
- Line 1: The Dockerfile currently installs compilers and -devel build tools
inside a single RUN layer, leaving artifacts in image history; refactor to a
multi-stage build by adding a separate builder stage (e.g., start a new FROM ...
AS builder), move the RUN that installs gcc, gcc-c++, make, perl, and -devel
packages and all build/compile steps into that builder stage, then create a
final stage from the original base image and COPY only the built artifacts
(binaries, wheels, libraries) from the builder stage into the final image;
remove any installation/removal commands from the final stage so build tools are
never present in final layers (reference the existing RUN that installs and
removes build tools in the Dockerfile and replace it with builder-stage steps
and a minimal final stage).
- Around line 32-41: The Dockerfile currently downloads and installs UCX
tarballs/rpms directly (using UCX_VERSION in the RUN block) without integrity
checks; update the UCX download/install steps (and the OpenMPI download lines
noted at 61–63) to verify cryptographic integrity before installation: fetch the
corresponding checksum (e.g., .sha256 or .sha512) and/or signature (.asc),
import and trust the publisher GPG key, verify the signature with gpg or compare
the computed checksum against the published value, and only proceed to
extract/rpm -ivh if verification succeeds; ensure failures abort the build and
remove unverified artifacts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0b5b0db2-fa40-484f-b8c8-c82ba01ff8e1
📒 Files selected for processing (1)
images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile (1)
1-1:⚠️ Potential issue | 🟠 MajorSingle-stage build still violates the mandatory multi-stage policy.
Severity: Major (CWE-250).
Exploit scenario: build compilers/dev tooling are removed from the final filesystem, but remain recoverable from layer history; a registry/layer reader can extract and reuse those artifacts, increasing post-compromise capability.Remediation scaffold
-FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b +FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b AS builder @@ -RUN dnf install -y \ +RUN dnf install -y \ rdma-core libibverbs librdmacm libibumad libmlx5 infiniband-diags \ hwloc libevent pmix \ gcc gcc-c++ make perl \ rdma-core-devel libibverbs-devel librdmacm-devel \ hwloc-devel libevent-devel pmix-devel zlib-devel \ @@ - && dnf remove -y \ - gcc gcc-c++ make perl \ - rdma-core-devel libibverbs-devel librdmacm-devel \ - hwloc-devel libevent-devel pmix-devel zlib-devel \ - && dnf clean all \ + && dnf clean all \ && rm -rf /tmp/openmpi-${OPENMPI_VERSION} /tmp/openmpi.tar.gz + +FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b AS runtime +RUN dnf install -y openssh-server libjpeg-turbo libpng libwebp rdma-core libibverbs librdmacm libibumad libmlx5 infiniband-diags hwloc libevent pmix && dnf clean all +COPY --from=builder /usr/lib64/openmpi /usr/lib64/openmpi +COPY --from=builder /usr/lib64/ucx /usr/lib64/ucx +COPY --from=builder /usr/bin/mpiexec /usr/bin/mpiexec +COPY --from=builder /usr/bin/orted /usr/bin/ortedAs per coding guidelines,
**/Dockerfile*: DOCKERFILE SECURITY (CWE-798, CWE-250):4. Use multi-stage builds.Also applies to: 56-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile` at line 1, The Dockerfile currently uses a single-stage build (base image: quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b) which violates the multi-stage build policy; refactor it into a multi-stage Dockerfile by adding a builder stage that contains compilers and dev tooling to perform all build steps, then COPY only the produced runtime artifacts into a minimal final stage (remove build tools from the final image), ensuring final stage starts FROM a slim/runtime base and only copies necessary files from the builder stage to eliminate recoverable build artifacts in image layers (update the Dockerfile references and stage names accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Around line 39-45: The Dockerfile currently installs and removes UCX RPMs
using rpm with the --nodeps flag (the rpm -ivh ... /tmp/ucx-*.rpm sequence and
the corresponding rpm -e removal commands), which bypasses dependency checks;
remove the --nodeps option from those rpm invocations so package dependencies
are validated during install/uninstall (ensure any required dependency packages
are added to the Dockerfile or handled via dnf/yum to satisfy them before
installing the UCX RPMs).
- Around line 135-140: The user sshd config (.sshd_config) sets a HostKey that
is never created, so update the container startup (e.g., uid_entrypoint.sh) to
ensure the user .ssh directory exists with correct permissions and to generate a
host key if missing: create the .ssh dir, chmod 700, check for the id_rsa host
key, and if absent run ssh-keygen to create an RSA host key and chmod 600 the
key before launching sshd; ensure this runs prior to starting sshd so the
HostKey entry in .sshd_config points to an existing key.
- Line 143: The Dockerfile currently force-upgrades an unpinned micropipenv (RUN
pip install -U "micropipenv[toml]") and separately installs s3fs outside the
Pipfile.lock; update this by pinning micropipenv to a specific vetted version
(remove the -U behavior) and ensure s3fs is added to the Pipfile.lock (with
hashes) so it is installed via micropipenv's lockfile rather than an ad-hoc pip
install; alternatively, if you must pip-install s3fs in the Dockerfile, install
a pinned s3fs==<version> with a verified hash and remove any --no-deps installs
that bypass lockfile resolution to prevent arbitrary remote wheel execution.
---
Duplicate comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Line 1: The Dockerfile currently uses a single-stage build (base image:
quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b)
which violates the multi-stage build policy; refactor it into a multi-stage
Dockerfile by adding a builder stage that contains compilers and dev tooling to
perform all build steps, then COPY only the produced runtime artifacts into a
minimal final stage (remove build tools from the final image), ensuring final
stage starts FROM a slim/runtime base and only copies necessary files from the
builder stage to eliminate recoverable build artifacts in image layers (update
the Dockerfile references and stage names accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: b9e7dcf5-4c1d-475c-97ef-f8936d299750
📒 Files selected for processing (1)
images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kapil27, sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Bug Fixes